In the fall of 2022, I got to TA a course called Phylogenetics and the Fossil Record, taught by my PhD advisor, Graham Slater. Although I TA’ed almost every single quarter during my time at the University of Chicago, this particular experience was pretty unique, for a number of reasons. First, it was a mixed undergraduate/graduate course, whereas all the other courses I TA’ed were offered exclusively to undergrads, and often specifically to non-science majors. Second, it was the only course I ever TA’ed that was directly and specifically related to my own research interests. Third, and related to the previous point, my responsibilities went a little further than they usually would. I wasn’t just in charge of grading and leading discussion or problem-set sessions: I was responsible for running a weekly two-hour lab where the students would get some hands-on practice with the material they’d learned about in their lectures, and Graham gave me a lot of freedom when it came to deciding how these labs should be organized and crafting the relevant exercises. This freedom extended to the choice of the software to be used; more on that below. Despite Graham’s and my best efforts, we had to make some adjustments to the originally planned course schedule, so on a few occasions, I even ended up having to introduce foundational concepts in advance of the corresponding lectures, such as the theory behind MCMC.
In contrast to molecular phylogenetics, where there is little interest in individual characters (at the end of the day, a nucleotide is a nucleotide) and dataset assembly is typically automated to a much greater extent, morphological phylogenetics places a lot more emphasis on the process of creating the dataset by scoring features that exhibit informative variation.1 Therefore, the very first lab of the course was devoted precisely to this process, with a fun spin: Graham had students assemble their own character matrices for several different types of pasta, with orzo as the outgroup.2 (The idea of making a phylogeny of foodstuffs seems to be either a pedagogical homoplasy or an instance of horizontal meme transfer – see another example by Josh Drew here.)
Not necessarily what you expect to find when preparing to teach a phylogenetics class.
This was a really cool way of showing the students how morphological phylogenetic datasets are assembled, and teaching them about the importance of character coding, with great discussions taking place in different breakout groups about whether a given character should be treated as discrete or continuous, ordered or unordered, etc. I felt a little more ambivalent about our decision to stick with the pasta toy example for the rest of the course: every time a new method was introduced, the corresponding homework assignment would have the students apply it to their own pasta dataset. This worked great for parsimony, where you can have a lot of fun with inferring synapomorphies for particular nodes (extreme elongation evolving in the last common ancestor of spaghetti and linguine, etc.), but it got a little too cutesy for its own good when we got to tip-dating, which involves quantities like fossil ages and extinction rates that have no reasonable interpretation unless you are dealing with things that actually die out at different points in time and leave behind remains, like biological species or languages. In this latter case, I found it easier to have the students analyze a real dataset. Graham picked the one from Tedford et al.’s (2009) analysis of North American canids for this purpose, which I also used as an example when demoing the different methods.
Before each lab, I would upload a PDF “handout” to Canvas for the students to read prior to our meeting, though this wasn’t essential, and remained mostly aspirational anyway. The handout would explain key concepts and show how to implement them in the program we were about to work with, and I would use it as a guide for our lab activity. Interspersed throughout the handout, there would be exercises that the students were supposed to complete on their own and turn in the following week – usually, these would ask them to modify the provided code in some way, if only to run it on their own dataset. The idea behind explaining every line of code in detail was to allow the students to refer back to the handout as they were working their way through the exercises.
As with many similar courses, we covered the three main contemporary methods for inferring phylogenies from discrete characters, i.e., parsimony, maximum likelihood, and Bayesian inference. Unlike the other two methods, maximum likelihood is seldom used in fossil phylogenetics, so one could perhaps feel justified in skipping it, but from a pedagogical standpoint, it represents a very convenient bridge between parsimony (given their shared goal of optimizing a particular criterion, accomplished in practice by heuristic searches of tree space) and Bayesian inference (given their shared reliance on the likelihood function). Back in 2020, when the course was TA’ed by my former colleague Anna Wisniewski, PAUP* was used for parsimony, RAxML for maximum likelihood, MrBayes for time-free Bayesian inference, and BEAST 2 for tip-dating. Two years later, some changes to this list were in order. RAxML was always rather inflexible when dealing with morphological data, and by 2022, it had become legacy software – no longer under active development, and replaced by RAxML-NG. The latter program has many features to commend it, and I’ve happily used it myself for combined analyses of molecular and morphological data, but it unfortunately doesn’t implement a substitution model for ordered characters out of the box. (The old RAxML did implement it, but incorrectly.) I’m sure this feature is only missed by a tiny percentage of RAxML-NG users, and you can apparently get around its absence by manually specifying the appropriate instantaneous rate matrix yourself, which is the same way ordered characters had to be handled in BEAST 2 when I last used it (though in RAxML-NG, this can apparently lead to some pesky underflow issues). However, when analyzing real-world morphological data, being able to treat some but not all multistate characters as ordered is kind of a big deal, so I designed the maximum-likelihood lab around IQ-TREE, which makes this easy (with one important caveat; see below) and which I already had experience applying to morphological data as part of the analyses that ended up published as Černý & Simonoff (2023).
With regard to Bayesian inference, it seemed like overkill to me to introduce a new program every week, especially since the learning curve tends to be quite steep for each one of them. MrBayes had two important advantages: it can do both time-free inference and tip-dating, and the analysis setup (by means of setting a limited number of predefined parameters either interactively or in a separate block at the end of one’s Nexus file) resembles PAUP* quite a bit, allowing the students to build on what they learned earlier in the course. However, like RAxML, MrBayes is legacy software, which is both a blessing and a curse. It means that it is perhaps less likely to crash with a cryptic error message than competing programs like BEAST 2 and RevBayes, which are under active development and which allow users to freely assemble their own models by mixing and matching a large number of building blocks. However, it also means that things that were once supposed to work no longer do: as I recently found out, MrBayes can’t be used together with the BEAGLE library on ARM64 processors, which was not a concern in 2012 when the last minor release (3.2) came out, but which is a concern now that Apple has switched to this architecture for its laptops and desktops. Most importantly, it simply means that the program will be less and less useful going forward, as users increasingly abandon it for exciting new models and features implemented elsewhere.
In contrast, BEAST is a mature piece of software which is nevertheless still under active development, and which even comes with a GUI called BEAUti that should, in theory, make it much easier to use than the alternatives. Unfortunately, by design, BEAST only infers time trees, so it doesn’t support the full range of use cases we wanted to cover in the course. Moreover, my personal impression is that the GUI doesn’t quite live up to its promise. In my own BEAST 2 tip-dating analyses, I always found it necessary to go back to the XML that BEAUti generated for me and edit it manually (e.g., to treat certain characters as ordered, as mentioned above), which could be quite intimidating. A BEAST XML can be a beast of a file (easily reaching several megabytes in size when setting up total-evidence analyses that include molecular alignments), and it doesn’t clearly separate the data from the modeling choices (priors and parameters) and inference settings (sampler type, chain length, logging options, etc.). The BEAST 2 development team seems to be well aware of these problems, as shown by their recent proposal for a high-level model specification language called LinguaPhylo (Drummond et al. 2023), and by a recently circulated survey raising the prospect of a backward-incompatible “BEAST 3” that entirely abandons the XML format in favor of such a language. Of note here is the fact LinguaPhylo is markedly similar to Rev, the model specification language used by RevBayes, so for now, the simplest solution is perhaps to use the latter program.
Indeed, in the fall of 2022, RevBayes seemed like the logical choice
for the course Graham and I wanted to teach. Unlike BEAST, RevBayes can
handle time-free inference as well as tip-dating, and unlike MrBayes, it
is under active, community-driven development. It also helped that by
the time I was asked to TA Phylogenetics and the Fossil Record,
I’d already had some – though still quite rudimentary – experience as a
RevBayes developer: I had implemented a new supertree method in it for
chapter 1 of my dissertation, and was slowly becoming familiar enough
with the code base to start contributing bug fixes. What sealed the deal
was the fact that after covering tree estimation in detail, we planned
to dedicate the last lab of the course to things one can do with trees,
such as trait evolution and/or biogeography. Unlike MrBayes and BEAST,
which are narrowly focused on phylogenetic inference and would require
us to switch to yet another piece of software (perhaps R packages such
as geiger or BioGeoBEARS) for this one last
lab, RevBayes tries to be a one-stop shop for everything phylogenetic,
and provides a framework that is general enough to make such analyses
possible. The undoubtedly not-so-easy effort of learning how to use
RevBayes would pay off, because once the program was introduced, the
students could just keep using it for everything we were about to throw
at them.
Or at least, that was the idea; the final lab never actually took place. In part, that was because I realized (more on that below) that I had to dedicate an entire lab to introducing the basics of how RevBayes works, before getting around to using it for tree inference. I can see how this might seem ironic, and as an admission that RevBayes was more trouble than it’s worth in the end. However, I don’t feel too bad about it: we were already falling behind the schedule in general, including with the lectures, so it’s not like the extra lab held up the whole course and made it impossible to cover material that would have been covered otherwise.
Once the decision was made to go with RevBayes for the Bayesian labs,
my initial plan was to start with Tracy Heath’s archery
tutorial, which had constituted my own
introduction to RevBayes back in the halcyon days of October 2018.
Fortunately, I had the good sense to ask for recommendations in the
RevBayes developer Slack, and immediately received excellent advice,
tips, and suggestions from Jeremy Brown, Jiansi
Gao, Bruno
Petrucci, Orlando Schwery,
Carrie
Tribble, Rachel
Warnock, and April
Wright – it really drove home the value of using software that has a
lively, engaged developer community behind it. A number of people kindly
shared their own lecture and workshop slides with me, many of which I
shamelessly copied (with acknowledgments, of course – see the blue text
in the handouts below). Even more importantly, I was able to rely on the
experience of people who had taught RevBayes to undergrads and grad
students before, and prepare myself for those aspects of the process
that were most likely to pose difficulties. One piece of feedback that
pretty much everyone agreed on was that the archery tutorial was still a
little too advanced to serve as a general introduction to RevBayes, and
that my original idea of trying to keep things simple by not going into
the distinction between constant, deterministic, and stochastic
variables (as well as their respective assignment operators,
<-, :=, and ~) was misguided.
On the contrary, the best way forward was to spend as much time as
necessary on these seeming details to make sure that the students were
comfortable both with the underlying concepts from probability theory,
and with representing them using the Rev syntax. In fact, the best
approach would be to teach Rev much as one would any other programming
language, introducing the key pieces of syntax one by one. It was for
this reason that I decided to alter the original lab schedule and spend
one entire meeting introducing Rev as gently as I possibly could, giving
rise to Lab 6:
{{< embed-pdf url=“Lab6.pdf” >}}
Some observations on the handout above:
func(obj)) and
methods (obj.method()), or the syntax for string
concatenation (dir = basedir + "RevBayes/Is/Awesome").The following week, it was time to introduce time-free Bayesian phylogenetic inference, but I still wanted to go over the archery tutorial first to make sure everyone understood exactly how the Metropolis-Hastings algorithm worked. This resulted in an extremely long handout, but I don’t actually remember the meeting in question feeling especially rushed; two hours were enough to cover both the archery portion and the phylogram inference portion of the lab. In any case, I hope the pedagogical benefits were worth having the handout run to 18 pages. When it comes to learning basic MCMC theory, there’s nothing quite like building a toy sampler yourself – this is something that Graham impressed on me early on, since it was one of the first things he had me do as his grad student.
{{< embed-pdf url=“Lab7.pdf” >}}
Notes:
RealPos, but when we draw from
dnUniform, the resulting stochastic variable is inferred to
be of the most restrictive type possible based on the bounds of the
distribution. Therefore, drawing from dnUniform(0, 0.5)
does not produce a RealPos but rather a
Probability, which is a subtype of RealPos
with the domain restricted to \([0,
1]\). I reported this as issue #308
specifically because the students ran into it when trying to complete
the exercise, and I regret to say that two years later, it is still
unresolved.4 If I were to rewrite the handout, I’d just
have the students experiment with a completely different prior – e.g.,
dnExponential(5).The final lab of the course was dedicated to tip-dating under the fossilized birth-death (FBD) process, and I tried to build as much as possible on the previous two labs when writing the corresponding handout. I mostly did this by exploiting what Warnock & Wright (2020) called the tripartite model, in which the inference of time-calibrated phylogenies can be decomposed into three mutually independent submodels: the substitution model, which describes how characters change along the branches of the tree; the clock model, which describes how the rate of the substitution process varies across these branches; and the tree model, which jointly specifies the topology of the tree and its branch lengths. As stated in the handout, the switch from time-free inference to tip-dating involves replacing the tree model and adding the clock model, which made it possible to re-use a good amount of the code that I’d gone over the week before:
{{< embed-pdf url=“Lab8.pdf” >}}
Notes:
mvSlide with delta=0.01,
delta=0.1, and delta=1. That’s definitely one
way to do it, but if I were to write the handout again, I would instead
recommend using the slice move (mvSlice), which
automatically chooses its step sizes based on the shape of the
likelihood function, so it can propose small steps in some regions of
parameter space and large steps in others. As a result, the acceptance
rate of the slice move is close to 1, which makes for very efficient
exploration.85cbe87),
but it will keep triggering the warning added in PR #559. This
is because the analysis will often sample trees with a sampled ancestor
at the root (i.e., trees in which a single taxon is ancestral to all
others), and the mvRootTimeSlideUniform move, which
operates on the age of the root, will attempt to shift its age outside
the bounds specified in the taxon file (see the discussion under issue #544).
The simplest way to avoid this warning being printed to the screen
several hundred thousand times is just to delete or comment out the
offending move.F or of clamping it to 0
is not an easy task. See the official
FBD tutorial for our current best shot at it.RealPos, but RevBayes quite reasonably cannot
guarantee that the result of subtracting one nonnegative real number
from another will also be nonnegative. Therefore, the type of \(358.9 - 318.1\) is Real, and
the whole expression (358.9 - 318.1)/qexp(0.95) evaluates
to a Real as well. But unfortunately, that means the
reciprocal of the resulting value can no longer be used as the rate
parameter of an exponential distribution, which has to be of type
RealPos. We can circumvent this problem by writing
abs(358.9 - 318.1)/qexp(0.95), abusing the absolute value
function as a type cast from Real to
RealPos.Overall, I had a good time teaching RevBayes, and I’m happy to say that two years later, I still stand by my assessment that it was the right choice for this type of course. That doesn’t mean I think it would be the best choice for every course. Earlier this year, Diego Pol, with whom I had the pleasure to collaborate on the description of Koleken, asked me what software I’d recommend for Bayesian tip-dating if the goal was to design a lab for undergrads that was as simple and user-friendly as possible. Given those criteria, I had to recommend MrBayes over RevBayes. However, when there is enough time to spend on MCMC theory, probabilistic graphical models, and the Rev syntax, using RevBayes in a classroom setting can be a rewarding experience.
My single biggest worry was that the students would come to dislike RevBayes for being too complicated and wonky. Surprisingly, that proved not to be the case, and I was repeatedly told that the RevBayes-based Bayesian labs were smooth sailing compared to the maximum-likelihood lab we’d done one week earlier. Was this because learning how to interpret probabilistic graphical models and code them up in a custom domain-specific programming language is actually pretty easy? Not necessarily, I’m afraid. It’s just that the maximum-likelihood lab turned out to be particularly frustrating for a reason that ultimately had to do with the highly technical problem of partitioning morphological datasets by the observed number of character states.
The notion that partitioning by state number is important in model-based analyses of discrete morphological data is, as far as I can tell, one of those ideas that just sort of float around in the community before someone finally takes it upon themselves to formally demonstrate it in print in a rigorous manner. Brief discussions of it could be found in the literature for some time. Gavryushkina et al. (2017) found that partitioning by state number improved model fit in terms of marginal likelihoods relative to an alternative treatment (default for most software; see below) where a single substitution model is applied to all characters, and the size of its instantaneous rate matrix \(Q\) is determined by the maximum of the state numbers observed in the dataset. King et al. (2017) agreed that partitioning by the number of states was desirable but argued that the lower stationary frequencies for characters with higher state numbers need to be countered by increasing exchangeabilities in order to prevent changes in such characters from being artificially upweighted – an argument I’m not sure I buy.5 More recently, Ruebenstahl et al. (2024) discussed the issue in the supplement of their paper on the evolution of sea scorpions. However, it was Khakurel et al. (2024) who finally tackled the issue head-on, and performed a detailed simulation study showing that the use of a single \(Q\)-matrix which is too large for most characters in the matrix leads to systematic underestimation of branch lengths.
If we’ve established that partitioning by the number of states matters, how hard is it to implement this option in existing phylogenetic software? It turns out the answer ranges from “easy” to “impossible”:
RAxML: A single \(Q\)-matrix is applied to the whole dataset, and its size is automatically determined based on the highest-numbered state therein. Manual partitioning might be an option, but since RAxML makes it impossible to treat some multistate characters as ordered and others as unordered, it’s just poorly suited to analyses of real-world morphological datasets in general.
RAxML-NG: The default behavior is the same, but
flexibility has increased. The partition file can now directly specify
substitution models (not just data types), and instead of a single
MULTI data type, we get fine-grained control over the
number of states \(k\). In theory, this
makes it possible to create partition files like this:
BIN+G+ASC_LEWIS, binary = 1-322
MULTI3_MK+G+ASC_LEWIS, threestate = 323-357
MULTI4_MK+G+ASC_LEWIS, fourstate = 358-377
MULTI5_MK+ASC_LEWIS, fivestate = 378-381
Unfortunately, real-world morphological datasets are never organized by the number of observed states, which means that we’re more likely to end up with this abomination:
BIN+G+ASC_LEWIS, binary = 1, 3-11, 13-22, 24-30, 32-34, 37, 38, 40-45, 47-51, 53-57, 59-62, 64, 65, 67-75, 77-87, 89-91, 93-112, 114-122, 126-131, 133, 135-141, 143-149, 152-162, 164, 165, 167-192, 194, 195, 197-200, 202, 206-214, 216-221, 223-225, 228-237, 239-242, 244-246, 248-252, 254-267, 269-273, 275, 276, 278-291, 293-295, 297-301, 303-305, 307-310, 312-319, 321-327, 329-330, 332-342, 345-346, 348-352, 354, 356-360, 362-365, 367-374, 376-381
MULTI3_MK+G+ASC_LEWIS, threestate = 2, 12, 35, 36, 39, 46, 52, 58, 63, 76, 88, 92, 113, 125, 151, 163, 166, 201, 203-205, 222, 226, 243, 247, 253, 277, 302, 306, 331, 343, 344, 353, 366, 375
MULTI4_MK+G+ASC_LEWIS, fourstate = 23, 31, 123, 124, 132, 134, 150, 193, 196, 215, 227, 238, 268, 274, 292, 296, 311, 320, 328, 347, 355
MULTI5_MK+ASC_LEWIS, fivestate = 66, 142, 227, 361
which, to add insult to injury, is non-trivial to produce, since we first need to write some sort of state-counting script.
MrBayes: The dataset is automatically partitioned by the number of observed states, so that each partition receives a \(Q\)-matrix of appropriate size, and the resulting partitions are implicit. This is extremely convenient for the user, but also rather limiting: it is impossible to unlink rate variation or clock models across such partitions, or to assign them different ascertainment bias corrections. The only way around this limitation is to once again specify the partitions manually.
RevBayes: Partitioning by state number is not
automatic, but easy to implement, and we don’t have to know the indices
of individual characters with a given number of states in advance. The
only thing we need is the maximum number of states observed in the
dataset (denoted max_state_num in the code below), which we
can usually extract from the Nexus file anyway:
idx = 1
max_state_num = 5
# Start iterating from 2 since we assume there are no 1-state (constant) characters
for (i in 2:max_state_num) {
# Make a copy of the full character matrix ('morpho')
chars[i - 1] <- morpho
# Only keep those characters whose state space size equals i
chars[i - 1].setNumStatesPartition(i)
# Get the number of characters with i states
nc = chars[i - 1].nchar()
# If this number is greater than zero, create the appropriate Q-matrix
if (nc > 0) {
print("There are "+nc+" characters with "+i+" states.")
Qmat[idx] <- fnJC(i)
# Draw from the phylogenetic continuous-time Markov chain
ctmc[idx] ~ dnPhyloCTMC(tree=tree, Q=Qmat[idx], branchRates=branch_rates,
siteRates=char_rates, type="Standard",
coding="variable")
# Clamp to the data
ctmc[idx].clamp( chars[i - 1] )
# Increment counter
idx = idx + 1
}
}
Note that the resulting partitions are explicit – if we wanted to, we
could use the idx index to give them unlinked rate
variation or clock models, or different ascertainment bias
corrections.
BEAST 2: The dataset is automatically partitioned by the number of observed states, so that each partition receives a \(Q\)-matrix of appropriate size, and the resulting partitions are explicit, meaning that we can assign them different models if we so choose. Credit where credit is due – this is the best behavior out of all phylogenetic programs I’ve used, hands down.
But wait! I still haven’t mentioned IQ-TREE, the software I decided to use for the maximum-likelihood lab. Well, IQ-TREE is worse in this regard than RAxML-NG, which is already less than ideal. The basic idea is the same, but subject to the further constraint that we can’t even mix binary and multistate characters in the same file. Therefore, if we try to translate our RAxML-NG partition file into a format recognizable by IQ-TREE:
#nexus
begin sets;
charset binary = 1 3-11 13-22 24-30 32-34 37 38 40-45 47-51 53-57 59-62 64 65 67-75 77-87 89-91 93-112 114-122 126-131 133 135-141 143-149 152-162 164 165 167-192 194 195 197-200 202 206-214 216-221 223-225 228-237 239-242 244-246 248-252 254-267 269-273 275 276 278-291 293-295 297-301 303-305 307-310 312-319 321-327 329-330 332-342 345-346 348-352 354 356-360 362-365 367-374 376-381;
charset threestate = 2 12 35 36 39 46 52 58 63 76 88 92 113 125 151 163 166 201 203-205 222 226 243 247 253 277 302 306 331 343 344 353 366 375;
charset fourstate = 23 31 123 124 132 134 150 193 196 215 227 238 268 274 292 296 311 320 328 347 355;
charset fivestate = 66 142 227 361;
charpartition mine =
JC2+ASC+G: binary,
MK+ASC+G: threestate,
MK+ASC+G: fourstate,
MK+ASC: fivestate;
end;
we get the following error:
Alignment most likely contains binary sequences
ERROR: Sequence <taxon0> has invalid character 2 at site 2
ERROR: Sequence <taxon1> has invalid character 2 at site 2
ERROR: Sequence <taxon2> has invalid character 2 at site 2
<...>
ERROR: ...many more...
The only thing that can save us here is the fact that IQ-TREE also allows individual partitions to come from separate files. This means we have to break up our character matrix into multiple files, each of which contains only those characters that share the same number of observed states:
#nexus
begin sets;
charset binary = MK2.phy;
charset threestate = MK3.phy;
charset fourstate = MK4.phy;
charset fivestate = MK5.phy;
charpartition mine =
JC2+ASC+G: binary,
MK+ASC+G: threestate,
MK+ASC+G: fourstate,
MK+ASC: fivestate,
end;
And again, this is all ignoring the fact that in practice, we will likely be dealing with a mix of ordered and unordered characters. So we have to split the dataset by character type first, and by state number second, to end up with something like this:
#nexus
begin sets;
charset part1 = MK2.phy;
charset part2 = MK3.phy;
charset part3 = MK4.phy;
charset part4 = MK5.phy;
charset part5 = ORDERED3.phy;
charset part6 = ORDERED4.phy;
charpartition mine =
JC2+ASC+G: part1,
MK+ASC+G: part2,
MK+ASC+G: part3,
MK+ASC: part4,
ORDERED+ASC+G: part5,
ORDERED+ASC: part6;
end;
My bright idea for how to keep the IQ-TREE exercise both manageable and theoretically rigorous was to provide the students with an R script that would take as its input a character matrix in Nexus format and a vector of indices of the characters to be ordered, and return partition-specific Phylip data files plus a Nexus partition file (similar to the one shown above) as its output. This way, the ugly preprocessing step would be taken care of, and the students could just focus on experimenting with different settings when executing the program from the command line.
The initial version of my script failed for pretty much everyone, and
in a unique way in each and every case. I spent a whole week answering
emails from students who were understandably anxious that the very first
step of their analysis, and the prerequisite for everything else they
were supposed to do in their assignment, was failing with error messages
they had trouble deciphering. In a few cases, this turned out to be due
to genuine mistakes in their Nexus files – e.g., using illegal symbols
in taxon names, or specifying the wrong NCHAR value. More
often, however, it turned out that my script made way stronger
assumptions about what a matrix file should look like than the actual
requirements specified by the Nexus
standard, and couldn’t deal with perfectly normal variation in
things like leading tabs, blank lines after commands, or the delimiter
used to separate the taxon names from their corresponding character
state strings. Some of the problems I should have definitely foreseen
(e.g., what if there are no ordered characters?), but others really were
unexpected (one student had a matrix consisting of only ordered
characters). All in all, this definitely made me appreciate the Nexus
parsers integrated in software like RevBayes and IQ-TREE, which
effortlessly handle everything you throw at them. You can see the final
outcome of my efforts to robustify the partitioning script here,
but needless to say, it is still pretty fragile: it relies on features
that the Nexus standard explicitly says are optional, like the
NTAX command, and makes all sorts of implicit assumptions
about the MATRIX block.
This is, in fact, an important – and I think underappreciated – cultural distinction between morphological and molecular phylogenetics. For people with a molecular background, phylogenetics mostly begins once the dataset has been assembled, and the question arises of what to do with it. Textbooks like Felsenstein (2004) or Yang (2014) don’t start with instructions on how to operate an Illumina sequencer, and it would be strange if they did (though see Bleidorn 2017). It is only with the problem of co-inferring alignment and phylogeny – a practice long recognized as desirable but often computationally prohibitive – that the dataset is not treated as a given. On the other hand, paleontologists sometimes seem to believe almost the exact opposite: that phylogenetics ends with the assembly of the character matrix, and once that crucial step has been accomplished, it matters little what methods one uses to analyze it. This belief (which Graham’s course did its best to combat) is usually justified by the argument that if the matrix is “flawed” or “miscoded”, no method will be able to extract useful phylogenetic information from it – often summed up by the catchphrase “garbage in, garbage out”. Exactly what that implies about the readiness of paleontologists to believe that their colleagues are in the habit of producing garbage is left as an exercise to the reader.↩︎
To my half-amusement, half-dismay, this recently became a real paper: see Nazari et al. (2024).↩︎
With a big hit to performance, of course, due to the overhead introduced by the interpreter.↩︎
In part, this probably reflects the fact that the issue
is just a particular instance of a more general problem, namely the fact
that Rev is strongly but implicitly typed. The choice is certainly
justifiable, as it makes it easy for Rev to interface with the C++ back
end, while keeping the front end simple for biologists who are most
likely to be familiar with dynamically typed languages like R and
Python. However, in cases like this, it can feel like the worst of both
worlds. Even though every probability is also a nonnegative real
(“RealPos” in Rev parlance),
Dist_unifProbability does not derive from
Distribution__RealPos, so you can’t plug it into
dnUniformTopologyBranchLength which expects the latter –
something that wouldn’t happen in a weakly typed language. You’d be
forgiven for thinking that the next best thing would be to explicitly
cast 0.5 as a RealPos rather than a
Probability, but alas, you don’t get explicit casts with
implicit typing. (In reality, it’s even worse: Rev actually
does have a function-style cast for converting from
RealPos to Probability, but not one to go from
Probability to RealPos.)↩︎
I’ve taken the liberty of discussing it with the authors of Khakurel et al. (2024), and I don’t think the proposed solution would work even if the problem were real, which is also somewhat questionable.↩︎
Dave Swofford has been promising to make PAUP* open-source for many years now. Here’s hoping it actually happens someday.↩︎